Data …(don’t wait for it)… .table!

Carlos Bort & Jordi Puigdellivol

2017-02-27

Outline

  1. Introduction to data.table package

    • What this talk it is / and it is not about
    • Motivation
  2. Basics data.table.

    • The i,j,by syntax. + Exercice
  3. Intermediate data.table.

    • .SD, multiple assign, setkey, setnames, %between%, .GRP, .BY, + Exercice.
  4. Advanced data.table and further steps

    • Shift, set, .EACHI, margarittr, merge, pattern: like, + Exercice.
    • Others. fread(), melt(), dcast(), IDateTime(), fwrite()

Objective

Code: https://www.dropbox.com/s/wdf45ujyktk30f9/datatable_talk_code.R?dl=0

What this talk it is about

What this talk it is NOT about

Data.table to the rescue!

Basic R programming code it is a bit verbose… (A BIT), right? And sometimes you have to wait for it….

Creating columns, filtering, grouping by… That’s why…

https://github.com/Rdatatable/data.table

Step 1, install the package:

#install.packages("data.table")
library(data.table)

What is a data.table?

What does this package so exceptional?

How I transform a data.frame into a data.table?

data("diamonds",package = "ggplot2")

setDT()

class(diamonds)
## [1] "tbl_df"     "tbl"        "data.frame"
setDT(diamonds)
class(diamonds)
## [1] "data.table" "data.frame"

Better than…

diamonds<-as.data.frame(diamonds)
data("flights",package = "nycflights13" )

Basics of data.table the i,j,by syntax

Sintaxis:

As you saw in the course, understanding data.table is like understanding SQL.

Idea: Three posible separations and two comas:

First separation, i, (where):

Subset the elements i, by its row number or condition:

# Select the first three columns
diamonds[1:3]
##    carat     cut color clarity depth table price    x    y    z
## 1:  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
# All cuts are Ideal :)
diamonds[cut=="Ideal"][1:2]
##    carat   cut color clarity depth table price    x    y    z
## 1:  0.23 Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.23 Ideal     J     VS1  62.8    56   340 3.93 3.90 2.46

Extract column as a vector

# By name
diamonds[["cut"]][1:10]
##  [1] Ideal     Premium   Good      Premium   Good      Very Good Very Good
##  [8] Very Good Fair      Very Good
## Levels: Fair < Good < Very Good < Premium < Ideal
# By number
diamonds[,1][1:10]
##     carat
##  1:  0.23
##  2:  0.21
##  3:  0.23
##  4:  0.29
##  5:  0.31
##  6:  0.24
##  7:  0.24
##  8:  0.26
##  9:  0.22
## 10:  0.23
# For multiple columns
diamonds[,1:2][1:10]
##     carat       cut
##  1:  0.23     Ideal
##  2:  0.21   Premium
##  3:  0.23      Good
##  4:  0.29   Premium
##  5:  0.31      Good
##  6:  0.24 Very Good
##  7:  0.24 Very Good
##  8:  0.26 Very Good
##  9:  0.22      Fair
## 10:  0.23 Very Good

Second separation, j, (select):

For all the observations selected on i calculate:

diamonds[,sum(depth)]
## [1] 3330763

Third separation, by, (group by):

For all the selected on i do the calculus j, by these different groups:

diamonds[,mean(carat),by=cut]
##          cut        V1
## 1:     Ideal 0.7028370
## 2:   Premium 0.8919549
## 3:      Good 0.8491847
## 4: Very Good 0.8063814
## 5:      Fair 1.0461366
# Only for a subset
diamonds[color %in% c("E","H"),mean(carat),by=cut]
##          cut        V1
## 1:     Ideal 0.6765489
## 2:   Premium 0.8678284
## 3:      Good 0.8179511
## 4: Very Good 0.7797940
## 5:      Fair 1.0650664

.N

Counts the number of elements per item. Couting it’s a calculus, j separator.

diamonds[,.N,by=cut]
##          cut     N
## 1:     Ideal 21551
## 2:   Premium 13791
## 3:      Good  4906
## 4: Very Good 12082
## 5:      Fair  1610

:=, Creating a colum by reference

To create a column you do not have to rewrite your object, it is add by reference. LHS, RHS.

# Creating a carat * depth column. One at a time
diamonds[,carat_depth:=carat*depth]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64
##        carat_depth
##     1:      14.145
##     2:      12.558
##     3:      13.087
##     4:      18.096
##     5:      19.623
##    ---            
## 53936:      43.776
## 53937:      45.432
## 53938:      43.960
## 53939:      52.460
## 53940:      46.650
# Or more than one column
diamonds[,`:=`(carat_depth = carat*depth,carat_depth_2 = carat*depth, carat_depth_3 = carat*depth )]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64
##        carat_depth carat_depth_2 carat_depth_3
##     1:      14.145        14.145        14.145
##     2:      12.558        12.558        12.558
##     3:      13.087        13.087        13.087
##     4:      18.096        18.096        18.096
##     5:      19.623        19.623        19.623
##    ---                                        
## 53936:      43.776        43.776        43.776
## 53937:      45.432        45.432        45.432
## 53938:      43.960        43.960        43.960
## 53939:      52.460        52.460        52.460
## 53940:      46.650        46.650        46.650
# Or more than one column
diamonds[,`:=`(carat_depth = carat*depth,carat_depth_2 = carat*depth, carat_depth_3 = carat*depth )]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64
##        carat_depth carat_depth_2 carat_depth_3
##     1:      14.145        14.145        14.145
##     2:      12.558        12.558        12.558
##     3:      13.087        13.087        13.087
##     4:      18.096        18.096        18.096
##     5:      19.623        19.623        19.623
##    ---                                        
## 53936:      43.776        43.776        43.776
## 53937:      45.432        45.432        45.432
## 53938:      43.960        43.960        43.960
## 53939:      52.460        52.460        52.460
## 53940:      46.650        46.650        46.650

Subset and rewrite!

diamonds[cut=="Fair",carat_depth:=0]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64
##        carat_depth carat_depth_2 carat_depth_3
##     1:      14.145        14.145        14.145
##     2:      12.558        12.558        12.558
##     3:      13.087        13.087        13.087
##     4:      18.096        18.096        18.096
##     5:      19.623        19.623        19.623
##    ---                                        
## 53936:      43.776        43.776        43.776
## 53937:      45.432        45.432        45.432
## 53938:      43.960        43.960        43.960
## 53939:      52.460        52.460        52.460
## 53940:      46.650        46.650        46.650
diamonds[cut=="Fair",sum(carat_depth)]
## [1] 0

:= NULL, deleting columns

Remove inmediatly without taking into account the size of data.table.

# Delete one
diamonds[,carat_depth_3:=NULL]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64
##        carat_depth carat_depth_2
##     1:      14.145        14.145
##     2:      12.558        12.558
##     3:      13.087        13.087
##     4:      18.096        18.096
##     5:      19.623        19.623
##    ---                          
## 53936:      43.776        43.776
## 53937:      45.432        45.432
## 53938:      43.960        43.960
## 53939:      52.460        52.460
## 53940:      46.650        46.650
# More than one
my_columns<-c("carat_depth","carat_depth_2")
diamonds[,(my_columns):=NULL]
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

Multiple aggregations, multiple j, by multiple by

You can creat multiple agregations with list(newcolumn_1=sum(x),newcolumn_2=sd(x),...)

# You can change list() for . in the next example ;)
diamonds[, list(mean_carat=mean(carat),sd_carat=sd(carat)), by=list(cut,color)][1:10]
##           cut color mean_carat  sd_carat
##  1:     Ideal     E  0.5784012 0.3125406
##  2:   Premium     E  0.7177450 0.4097847
##  3:      Good     E  0.7451340 0.3808900
##  4:   Premium     I  1.1449370 0.6136041
##  5:      Good     J  1.0995440 0.5371248
##  6: Very Good     J  1.1332153 0.5559197
##  7: Very Good     I  1.0469518 0.5519840
##  8: Very Good     H  0.9159485 0.5029465
##  9:      Fair     E  0.8566071 0.3645848
## 10:     Ideal     J  1.0635938 0.5821001

The Chaining

You can combine different data.table instruction by chaining theme with the square brackets.

diamonds[, .(mean_carat=mean(carat),sd_carat=sd(carat)), by=.(cut,color)][order(-cut)][1:10]
##         cut color mean_carat  sd_carat
##  1:   Ideal     E  0.5784012 0.3125406
##  2:   Ideal     J  1.0635938 0.5821001
##  3:   Ideal     I  0.9130291 0.5537277
##  4:   Ideal     G  0.7007146 0.4106182
##  5:   Ideal     D  0.5657657 0.2993503
##  6:   Ideal     F  0.6558285 0.3745245
##  7:   Ideal     H  0.7995249 0.4868741
##  8: Premium     E  0.7177450 0.4097847
##  9: Premium     I  1.1449370 0.6136041
## 10: Premium     F  0.8270356 0.4201959

setcolorder()

Change columns order by reference.

colnames(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
# Reverse the column names by reference
setcolorder(diamonds,rev(colnames(diamonds)))
colnames(diamonds)
##  [1] "z"       "y"       "x"       "price"   "table"   "depth"   "clarity"
##  [8] "color"   "cut"     "carat"

Let’s do some exercices!

Data.table/Basic_Exercice

data("flights",package = "nycflights13" )

Flight columns:

Exercice

1.- Transform flights into a data.table (setDT())

## [1] "tbl_df"     "tbl"        "data.frame"
## [1] "data.table" "data.frame"

2.- How many where destinated to Miami?

## [1] 11728

3.- Calculate the total days of air_time (has NA’s). (pssst: lubridate:::seconds_to_period() )

## [1] 34254.59
## [1] "34254d 14H 10M 0S"

4.- Mean difference by origin airport.

##    origin       V1
## 1:    EWR 15.10795
## 2:    LGA 10.34688
## 3:    JFK 12.11216

5.- Create a new data.table delays_month with mean, sd, and number of flights per month. (Same column names!)

##     month mean_delay sd_delay n_flights
##  1:     1  10.036665 36.39031     27004
##  2:    10   6.243988 29.67176     28889
##  3:    11   5.435362 27.58836     27268
##  4:    12  16.576688 41.87681     28135
##  5:     2  10.816843 36.26655     24951
##  6:     3  13.227076 40.13097     28834
##  7:     4  13.938038 42.96626     28330
##  8:     5  12.986859 39.35283     28796
##  9:     6  20.846332 51.45694     28243
## 10:     7  21.727787 51.61608     29425
## 11:     8  12.611040 37.66692     29327
## 12:     9   6.722476 35.61480     27574

6.- Create a new binary column named not_delayed when dep_delay <0 , equals 1. (psssst: ifelse()).

##         year month day dep_time sched_dep_time dep_delay arr_time
##      1: 2013     1   1      517            515         2      830
##      2: 2013     1   1      533            529         4      850
##      3: 2013     1   1      542            540         2      923
##      4: 2013     1   1      544            545        -1     1004
##      5: 2013     1   1      554            600        -6      812
##     ---                                                          
## 336772: 2013     9  30       NA           1455        NA       NA
## 336773: 2013     9  30       NA           2200        NA       NA
## 336774: 2013     9  30       NA           1210        NA       NA
## 336775: 2013     9  30       NA           1159        NA       NA
## 336776: 2013     9  30       NA            840        NA       NA
##         sched_arr_time arr_delay carrier flight tailnum origin dest
##      1:            819        11      UA   1545  N14228    EWR  IAH
##      2:            830        20      UA   1714  N24211    LGA  IAH
##      3:            850        33      AA   1141  N619AA    JFK  MIA
##      4:           1022       -18      B6    725  N804JB    JFK  BQN
##      5:            837       -25      DL    461  N668DN    LGA  ATL
##     ---                                                            
## 336772:           1634        NA      9E   3393      NA    JFK  DCA
## 336773:           2312        NA      9E   3525      NA    LGA  SYR
## 336774:           1330        NA      MQ   3461  N535MQ    LGA  BNA
## 336775:           1344        NA      MQ   3572  N511MQ    LGA  CLE
## 336776:           1020        NA      MQ   3531  N839MQ    LGA  RDU
##         air_time distance hour minute           time_hour not_delayed
##      1:      227     1400    5     15 2013-01-01 05:00:00           0
##      2:      227     1416    5     29 2013-01-01 05:00:00           0
##      3:      160     1089    5     40 2013-01-01 05:00:00           0
##      4:      183     1576    5     45 2013-01-01 05:00:00           1
##      5:      116      762    6      0 2013-01-01 06:00:00           1
##     ---                                                              
## 336772:       NA      213   14     55 2013-09-30 14:00:00          NA
## 336773:       NA      198   22      0 2013-09-30 22:00:00          NA
## 336774:       NA      764   12     10 2013-09-30 12:00:00          NA
## 336775:       NA      419   11     59 2013-09-30 11:00:00          NA
## 336776:       NA      431    8     40 2013-09-30 08:00:00          NA

7.- Calculate the mean of not delayed flights by company. Use the not_delayed column and name it as perc_not_delayed! Use the chaining to order by perc_not_delayed. (Add .N column)

##     carrier perc_not_delayed     N
##  1:      HA        0.7573099   342
##  2:      US        0.7276707 20536
##  3:      OO        0.6896552    32
##  4:      AS        0.6432584   714
##  5:      AA        0.6332845 32729
##  6:      MQ        0.6288996 26397
##  7:      DL        0.6208831 48110
##  8:      9E        0.5579927 18460
##  9:      B6        0.5529362 54635
## 10:      YV        0.5394495   601
## 11:      EV        0.5171353 54173
## 12:      VX        0.4862600  5162
## 13:      UA        0.4712223 58665
## 14:      FL        0.4395984  3260
## 15:      F9        0.4354839   685
## 16:      WN        0.3755690 12275

8.- Load the airlines dataset. With one line of code, filter the airlines dataset with the previous 10 best airlines. (ppssst: airlines[column %in% super_chunk_of_code])

data("airlines",package = "nycflights13" )
setDT(airlines)
##     carrier                   name
##  1:      9E      Endeavor Air Inc.
##  2:      AA American Airlines Inc.
##  3:      AS   Alaska Airlines Inc.
##  4:      B6        JetBlue Airways
##  5:      DL   Delta Air Lines Inc.
##  6:      HA Hawaiian Airlines Inc.
##  7:      MQ              Envoy Air
##  8:      OO  SkyWest Airlines Inc.
##  9:      US        US Airways Inc.
## 10:      YV     Mesa Airlines Inc.

Honolulu-R-Users-Group :_)

For the bRave

9.- Use the previous delays_month data.table and plot the mean delays with the sd delays. Try to use ggplot2 and ggrepel (for the labels). (+ info, google)

Linear fit!

10.- Replicate this plot.

##      month carrier mean_delay
##   1:     1      UA   8.326167
##   2:     1      AA   6.932358
##   3:     1      B6   9.493436
##   4:     1      DL   3.849768
##   5:     1      EV  24.228879
##  ---                         
## 181:     9      9E   7.754232
## 182:     9      HA  -5.440000
## 183:     9      F9   8.263158
## 184:     9      YV   8.880952
## 185:     9      OO  -4.941176

Inspired by: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html

Intermediate

‘:=’

Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form.

diamonds[ , `:=` ( avg_price = round( mean( price ), 2), avg_carat = round( mean( carat ), 2) ) , by = cut ]
##           z    y    x price table depth clarity color       cut carat
##     1: 2.43 3.98 3.95   326    55  61.5     SI2     E     Ideal  0.23
##     2: 2.31 3.84 3.89   326    61  59.8     SI1     E   Premium  0.21
##     3: 2.31 4.07 4.05   327    65  56.9     VS1     E      Good  0.23
##     4: 2.63 4.23 4.20   334    58  62.4     VS2     I   Premium  0.29
##     5: 2.75 4.35 4.34   335    58  63.3     SI2     J      Good  0.31
##    ---                                                               
## 53936: 3.50 5.76 5.75  2757    57  60.8     SI1     D     Ideal  0.72
## 53937: 3.61 5.75 5.69  2757    55  63.1     SI1     D      Good  0.72
## 53938: 3.56 5.68 5.66  2757    60  62.8     SI1     D Very Good  0.70
## 53939: 3.74 6.12 6.15  2757    58  61.0     SI2     H   Premium  0.86
## 53940: 3.64 5.87 5.83  2757    55  62.2     SI2     D     Ideal  0.75
##        avg_price avg_carat
##     1:   3457.54      0.70
##     2:   4584.26      0.89
##     3:   3928.86      0.85
##     4:   4584.26      0.89
##     5:   3928.86      0.85
##    ---                    
## 53936:   3457.54      0.70
## 53937:   3928.86      0.85
## 53938:   3981.76      0.81
## 53939:   4584.26      0.89
## 53940:   3457.54      0.70
vars = c( "price" , "carat" )
new_vars=paste0("avg_",vars)
diamonds[ , c( new_vars, "clarity" ) := list( round( avg_price, 2), round( avg_carat, 2 ) , as.character(clarity) ) ]
##           z    y    x price table depth clarity color       cut carat
##     1: 2.43 3.98 3.95   326    55  61.5     SI2     E     Ideal  0.23
##     2: 2.31 3.84 3.89   326    61  59.8     SI1     E   Premium  0.21
##     3: 2.31 4.07 4.05   327    65  56.9     VS1     E      Good  0.23
##     4: 2.63 4.23 4.20   334    58  62.4     VS2     I   Premium  0.29
##     5: 2.75 4.35 4.34   335    58  63.3     SI2     J      Good  0.31
##    ---                                                               
## 53936: 3.50 5.76 5.75  2757    57  60.8     SI1     D     Ideal  0.72
## 53937: 3.61 5.75 5.69  2757    55  63.1     SI1     D      Good  0.72
## 53938: 3.56 5.68 5.66  2757    60  62.8     SI1     D Very Good  0.70
## 53939: 3.74 6.12 6.15  2757    58  61.0     SI2     H   Premium  0.86
## 53940: 3.64 5.87 5.83  2757    55  62.2     SI2     D     Ideal  0.75
##        avg_price avg_carat
##     1:   3457.54      0.70
##     2:   4584.26      0.89
##     3:   3928.86      0.85
##     4:   4584.26      0.89
##     5:   3928.86      0.85
##    ---                    
## 53936:   3457.54      0.70
## 53937:   3928.86      0.85
## 53938:   3981.76      0.81
## 53939:   4584.26      0.89
## 53940:   3457.54      0.70

setkey

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=.

setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always.

setkey(diamonds,carat)
setkeyv( diamonds, c( "carat", "cut" ) )

setnames

setnames( diamonds, c("avg_price","avg_carat" ),c("mean_price","mean_carat" ) )

%between%

Intended for use in i in [.data.table.

diamonds[price %between% c(300,500),.(carat,price,cut)]
##       carat price       cut
##    1:  0.20   367 Very Good
##    2:  0.20   345   Premium
##    3:  0.20   367   Premium
##    4:  0.20   367   Premium
##    5:  0.20   367   Premium
##   ---                      
## 1745:  0.40   491      Good
## 1746:  0.40   484     Ideal
## 1747:  0.41   467      Good
## 1748:  0.43   490      Good
## 1749:  0.43   452   Premium
range = data.table(lower = c(2000,3000), upper = c(2500,3500))
diamonds[price %inrange% range, .(carat,price,cut)]
##       carat price       cut
##    1:  0.30  2366 Very Good
##    2:  0.34  2160      Fair
##    3:  0.34  2287     Ideal
##    4:  0.34  2287     Ideal
##    5:  0.34  2346     Ideal
##   ---                      
## 5501:  1.50  3179      Good
## 5502:  1.50  3457   Premium
## 5503:  1.51  3497      Good
## 5504:  1.52  3105      Good
## 5505:  1.52  3105      Good

Special-symblos

.SD, .BY, .N, .I and .GRP are read only symbols for use in j. .N can be used in i as well

.SD

.SD is a data.table containing the Subset of x’s Data for each group, excluding any columns used in by (or keyby).

diamonds[,lapply(.SD,floor),.SDcols=c("x","y","z")]
##         x  y z
##     1:  3  3 2
##     2:  3  3 2
##     3:  3  3 2
##     4:  3  3 2
##     5:  3  3 2
##    ---        
## 53936: 10 10 6
## 53937: 10  9 6
## 53938: 10  9 6
## 53939: 10 10 6
## 53940: 10 10 6
newest_vars=paste0("new_mean_",vars)
diamonds[ , c(newest_vars) := lapply( .SD, function(x){ round( mean( x ) ,2 ) }) , by = cut,.SDcols=vars ]
##           z     y     x price table depth clarity color       cut carat
##     1: 2.36  3.71  3.74   367    59  63.4     VS2     E Very Good  0.20
##     2: 2.27  3.75  3.79   345    62  60.2     SI2     E   Premium  0.20
##     3: 2.26  3.77  3.79   367    62  59.8     VS2     E   Premium  0.20
##     4: 2.24  3.78  3.81   367    60  59.0     VS2     E   Premium  0.20
##     5: 2.32  3.78  3.81   367    59  61.1     VS2     E   Premium  0.20
##    ---                                                                 
## 53936: 6.17 10.10 10.14 15223    61  61.0      I1     I   Premium  4.01
## 53937: 6.24  9.94 10.02 15223    62  62.5      I1     J   Premium  4.01
## 53938: 6.43  9.85 10.00 17329    61  64.8      I1     H      Fair  4.13
## 53939: 6.72 10.16 10.23 18531    58  65.8      I1     J      Fair  4.50
## 53940: 6.98 10.54 10.74 18018    59  65.5      I1     J      Fair  5.01
##        mean_price mean_carat new_mean_price new_mean_carat
##     1:    3981.76       0.81        3981.76           0.81
##     2:    4584.26       0.89        4584.26           0.89
##     3:    4584.26       0.89        4584.26           0.89
##     4:    4584.26       0.89        4584.26           0.89
##     5:    4584.26       0.89        4584.26           0.89
##    ---                                                    
## 53936:    4584.26       0.89        4584.26           0.89
## 53937:    4584.26       0.89        4584.26           0.89
## 53938:    4358.76       1.05        4358.76           1.05
## 53939:    4358.76       1.05        4358.76           1.05
## 53940:    4358.76       1.05        4358.76           1.05

.BY

.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable.

par(mfrow=c(2,4))
invisible( diamonds[,.(hist(price)),by=clarity] )

invisible( diamonds[,.(hist(price,main=paste(.BY))),by=clarity] )

par(mfrow=c(1,1))

.GRP

.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.

diamonds[,.GRP,by=clarity]
##    clarity GRP
## 1:     VS2   1
## 2:     SI2   2
## 3:     SI1   3
## 4:    VVS2   4
## 5:     VS1   5
## 6:    VVS1   6
## 7:      IF   7
## 8:      I1   8

Data.table/Intermediate_Exercice

0.- Rename not_delayed column to i_hate_delays

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"      "i_hate_delays"

1.- Cast all the character filds tolower(). ( Of course in one line ;)

vars = c("carrier","tailnum","origin","dest")
##         year month day dep_time sched_dep_time dep_delay arr_time
##      1: 2013     1   1      517            515         2      830
##      2: 2013     1   1      533            529         4      850
##      3: 2013     1   1      542            540         2      923
##      4: 2013     1   1      544            545        -1     1004
##      5: 2013     1   1      554            600        -6      812
##     ---                                                          
## 336772: 2013     9  30       NA           1455        NA       NA
## 336773: 2013     9  30       NA           2200        NA       NA
## 336774: 2013     9  30       NA           1210        NA       NA
## 336775: 2013     9  30       NA           1159        NA       NA
## 336776: 2013     9  30       NA            840        NA       NA
##         sched_arr_time arr_delay carrier flight tailnum origin dest
##      1:            819        11      ua   1545  n14228    ewr  iah
##      2:            830        20      ua   1714  n24211    lga  iah
##      3:            850        33      aa   1141  n619aa    jfk  mia
##      4:           1022       -18      b6    725  n804jb    jfk  bqn
##      5:            837       -25      dl    461  n668dn    lga  atl
##     ---                                                            
## 336772:           1634        NA      9e   3393      NA    jfk  dca
## 336773:           2312        NA      9e   3525      NA    lga  syr
## 336774:           1330        NA      mq   3461  n535mq    lga  bna
## 336775:           1344        NA      mq   3572  n511mq    lga  cle
## 336776:           1020        NA      mq   3531  n839mq    lga  rdu
##         air_time distance hour minute           time_hour not_delayed
##      1:      227     1400    5     15 2013-01-01 05:00:00           0
##      2:      227     1416    5     29 2013-01-01 05:00:00           0
##      3:      160     1089    5     40 2013-01-01 05:00:00           0
##      4:      183     1576    5     45 2013-01-01 05:00:00           1
##      5:      116      762    6      0 2013-01-01 06:00:00           1
##     ---                                                              
## 336772:       NA      213   14     55 2013-09-30 14:00:00          NA
## 336773:       NA      198   22      0 2013-09-30 22:00:00          NA
## 336774:       NA      764   12     10 2013-09-30 12:00:00          NA
## 336775:       NA      419   11     59 2013-09-30 11:00:00          NA
## 336776:       NA      431    8     40 2013-09-30 08:00:00          NA

2.- Can you plot arr_delay vs dep_delay for each carrier wich has more than 25000 fligths in the same plot, in one line and… without using ggplot2? (muahahahahahaha)

par(mfrow=c(2,3))

## Empty data.table (0 rows) of 1 col: carrier
par(mfrow=c(1,1))

Advanced

shift

Lead or lag vectors, lists, data.frames or data.tables implemented in C for speed.

today=as.character(Sys.Date())
diamonds[,fake_data:= sample(seq(as.Date(today)-.N,as.Date(today),by="day"),.N)]
##           z     y     x price table depth clarity color       cut carat
##     1: 2.36  3.71  3.74   367    59  63.4     VS2     E Very Good  0.20
##     2: 2.27  3.75  3.79   345    62  60.2     SI2     E   Premium  0.20
##     3: 2.26  3.77  3.79   367    62  59.8     VS2     E   Premium  0.20
##     4: 2.24  3.78  3.81   367    60  59.0     VS2     E   Premium  0.20
##     5: 2.32  3.78  3.81   367    59  61.1     VS2     E   Premium  0.20
##    ---                                                                 
## 53936: 6.17 10.10 10.14 15223    61  61.0      I1     I   Premium  4.01
## 53937: 6.24  9.94 10.02 15223    62  62.5      I1     J   Premium  4.01
## 53938: 6.43  9.85 10.00 17329    61  64.8      I1     H      Fair  4.13
## 53939: 6.72 10.16 10.23 18531    58  65.8      I1     J      Fair  4.50
## 53940: 6.98 10.54 10.74 18018    59  65.5      I1     J      Fair  5.01
##        mean_price mean_carat new_mean_price new_mean_carat  fake_data
##     1:    3981.76       0.81        3981.76           0.81 1919-09-28
##     2:    4584.26       0.89        4584.26           0.89 1940-09-02
##     3:    4584.26       0.89        4584.26           0.89 1958-01-07
##     4:    4584.26       0.89        4584.26           0.89 1942-05-12
##     5:    4584.26       0.89        4584.26           0.89 1896-12-22
##    ---                                                               
## 53936:    4584.26       0.89        4584.26           0.89 1878-11-18
## 53937:    4584.26       0.89        4584.26           0.89 1946-03-04
## 53938:    4358.76       1.05        4358.76           1.05 1869-08-06
## 53939:    4358.76       1.05        4358.76           1.05 2009-09-25
## 53940:    4358.76       1.05        4358.76           1.05 1935-06-23
setkey(diamonds,"fake_data")
diamonds[, last_price := shift( price, n = 1 , type = "lag" ) ]
##           z    y    x price table depth clarity color       cut carat
##     1: 3.22 5.29 5.23  1346    55  61.2     SI1     F Very Good  0.54
##     2: 3.95 6.30 6.33  6154    54  62.5     VS2     F   Premium  1.00
##     3: 3.81 6.30 6.22  3691    62  60.9     SI2     D Very Good  0.94
##     4: 4.26 6.88 6.84  5797    56  62.1     SI1     I     Ideal  1.24
##     5: 2.70 4.43 4.40   786    57  61.2    VVS2     F     Ideal  0.32
##    ---                                                               
## 53936: 2.68 4.29 4.31  1013    55  62.3    VVS1     G     Ideal  0.30
## 53937: 3.61 5.91 5.87  3584    55  61.3     SI1     E     Ideal  0.76
## 53938: 3.24 5.25 5.28  1881    56  61.5     VS2     E   Premium  0.55
## 53939: 2.50 4.18 4.15   425    56  60.0     VS1     D Very Good  0.26
## 53940: 4.95 8.41 8.45 15198    60  58.7     SI2     D   Premium  2.11
##        mean_price mean_carat new_mean_price new_mean_carat  fake_data
##     1:    3981.76       0.81        3981.76           0.81 1869-06-23
##     2:    4584.26       0.89        4584.26           0.89 1869-06-24
##     3:    3981.76       0.81        3981.76           0.81 1869-06-25
##     4:    3457.54       0.70        3457.54           0.70 1869-06-26
##     5:    3457.54       0.70        3457.54           0.70 1869-06-27
##    ---                                                               
## 53936:    3457.54       0.70        3457.54           0.70 2017-02-23
## 53937:    3457.54       0.70        3457.54           0.70 2017-02-24
## 53938:    4584.26       0.89        4584.26           0.89 2017-02-25
## 53939:    3981.76       0.81        3981.76           0.81 2017-02-26
## 53940:    4584.26       0.89        4584.26           0.89 2017-02-27
##        last_price
##     1:         NA
##     2:       1346
##     3:       6154
##     4:       3691
##     5:       5797
##    ---           
## 53936:       4480
## 53937:       1013
## 53938:       3584
## 53939:       1881
## 53940:        425
diamonds[, next_price := shift( price, n = 1 , type = "lead" ), by = cut ]
##           z    y    x price table depth clarity color       cut carat
##     1: 3.22 5.29 5.23  1346    55  61.2     SI1     F Very Good  0.54
##     2: 3.95 6.30 6.33  6154    54  62.5     VS2     F   Premium  1.00
##     3: 3.81 6.30 6.22  3691    62  60.9     SI2     D Very Good  0.94
##     4: 4.26 6.88 6.84  5797    56  62.1     SI1     I     Ideal  1.24
##     5: 2.70 4.43 4.40   786    57  61.2    VVS2     F     Ideal  0.32
##    ---                                                               
## 53936: 2.68 4.29 4.31  1013    55  62.3    VVS1     G     Ideal  0.30
## 53937: 3.61 5.91 5.87  3584    55  61.3     SI1     E     Ideal  0.76
## 53938: 3.24 5.25 5.28  1881    56  61.5     VS2     E   Premium  0.55
## 53939: 2.50 4.18 4.15   425    56  60.0     VS1     D Very Good  0.26
## 53940: 4.95 8.41 8.45 15198    60  58.7     SI2     D   Premium  2.11
##        mean_price mean_carat new_mean_price new_mean_carat  fake_data
##     1:    3981.76       0.81        3981.76           0.81 1869-06-23
##     2:    4584.26       0.89        4584.26           0.89 1869-06-24
##     3:    3981.76       0.81        3981.76           0.81 1869-06-25
##     4:    3457.54       0.70        3457.54           0.70 1869-06-26
##     5:    3457.54       0.70        3457.54           0.70 1869-06-27
##    ---                                                               
## 53936:    3457.54       0.70        3457.54           0.70 2017-02-23
## 53937:    3457.54       0.70        3457.54           0.70 2017-02-24
## 53938:    4584.26       0.89        4584.26           0.89 2017-02-25
## 53939:    3981.76       0.81        3981.76           0.81 2017-02-26
## 53940:    4584.26       0.89        4584.26           0.89 2017-02-27
##        last_price next_price
##     1:         NA       3691
##     2:       1346      11127
##     3:       6154        529
##     4:       3691        786
##     5:       5797       2064
##    ---                      
## 53936:       4480       3584
## 53937:       1013         NA
## 53938:       3584      15198
## 53939:       1881         NA
## 53940:        425         NA

%like%

Intended for use in i in [.data.table.

diamonds[cut %like% "Good$" ]
##           z    y    x price table depth clarity color       cut carat
##     1: 3.22 5.29 5.23  1346    55  61.2     SI1     F Very Good  0.54
##     2: 3.81 6.30 6.22  3691    62  60.9     SI2     D Very Good  0.94
##     3: 2.74 4.53 4.47   529    54  60.9      IF     J Very Good  0.33
##     4: 4.27 6.68 6.72  6344    52  63.7     SI2     D      Good  1.20
##     5: 3.53 5.54 5.61  2198    60  63.3     SI1     F Very Good  0.71
##    ---                                                               
## 16984: 3.62 5.86 5.82  2422    57  62.0     SI2     G Very Good  0.76
## 16985: 4.55 7.29 7.25 14976    57  62.6     VS1     F Very Good  1.51
## 16986: 3.39 5.85 5.83  2559    61  58.0     VS1     F      Good  0.71
## 16987: 4.59 7.32 7.30 12648    56  62.8    VVS2     H Very Good  1.52
## 16988: 2.50 4.18 4.15   425    56  60.0     VS1     D Very Good  0.26
##        mean_price mean_carat new_mean_price new_mean_carat  fake_data
##     1:    3981.76       0.81        3981.76           0.81 1869-06-23
##     2:    3981.76       0.81        3981.76           0.81 1869-06-25
##     3:    3981.76       0.81        3981.76           0.81 1869-06-28
##     4:    3928.86       0.85        3928.86           0.85 1869-07-08
##     5:    3981.76       0.81        3981.76           0.81 1869-07-11
##    ---                                                               
## 16984:    3981.76       0.81        3981.76           0.81 2017-02-16
## 16985:    3981.76       0.81        3981.76           0.81 2017-02-17
## 16986:    3928.86       0.85        3928.86           0.85 2017-02-19
## 16987:    3981.76       0.81        3981.76           0.81 2017-02-20
## 16988:    3981.76       0.81        3981.76           0.81 2017-02-26
##        last_price next_price
##     1:         NA       3691
##     2:       6154        529
##     3:        786       2198
##     4:       5075        882
##     5:       6505       2574
##    ---                      
## 16984:       6727      14976
## 16985:       2422      12648
## 16986:        596         NA
## 16987:       2559        425
## 16988:       1881         NA

Merge

Fast merge of two data.tables. The data.table method behaves very similarly to that of data.frames except that, by default, it attempts to merge

- at first based on the shared key columns, and if there are none,
- then based on key columns of the first argument x, and if there are none,
- then based on the common columns between the two data.tables.
X = data.table(x = c("a","a","a","a","b","b","c","d"), some_staf = 1:8, key = "x")
Y = data.table(x = c("a","c","z"), z = c(100,500,1000), key = "x")
# You can do it two steps setkey(X, x)
X[Y]
##    x some_staf    z
## 1: a         1  100
## 2: a         2  100
## 3: a         3  100
## 4: a         4  100
## 5: c         7  500
## 6: z        NA 1000
Y[X]
##    x   z some_staf
## 1: a 100         1
## 2: a 100         2
## 3: a 100         3
## 4: a 100         4
## 5: b  NA         5
## 6: b  NA         6
## 7: c 500         7
## 8: d  NA         8
merge(X,Y,all=T)
##    x some_staf    z
## 1: a         1  100
## 2: a         2  100
## 3: a         3  100
## 4: a         4  100
## 5: b         5   NA
## 6: b         6   NA
## 7: c         7  500
## 8: d         8   NA
## 9: z        NA 1000
merge(X,Y,all=F)
##    x some_staf   z
## 1: a         1 100
## 2: a         2 100
## 3: a         3 100
## 4: a         4 100
## 5: c         7 500

Other functions

fread()

THE BEST FUNCTION FOR LOADING .CSV FILES

fwrite()

As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible.

melt & dcast

Data.table has optimized function for melting and casting a data.table.

IDate & ITime

IDateTime

system.time(IDateTime(flights[["time_hour"]]))
##    user  system elapsed 
##   0.038   0.012   0.049
nrow(flights)
## [1] 336776
#CRY

magrittr

Pipe operators for data.table

library(magrittr)
diamonds %>% setDT() %>% 
  .[, .(mean_carat=mean(carat),sd_carat=sd(carat)), by=.(cut,color)] %>%
  .[order(mean_carat)] %>% 
  .[1:3]
##      cut color mean_carat  sd_carat
## 1: Ideal     D  0.5657657 0.2993503
## 2: Ideal     E  0.5784012 0.3125406
## 3: Ideal     F  0.6558285 0.3745245

PIPES outside the hadleyverse!

https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

Data.table/Advanced_Exercice

1.- magrittr way! Calculate the mean of not delayed flights by company. Use the not_delayed column and name it as perc_not_delayed! Use the chaining to order by perc_not_delayed. (Add .N column)

##     carrier perc_not_delayed     N
##  1:      ha        0.7573099   342
##  2:      us        0.7276707 20536
##  3:      oo        0.6896552    32
##  4:      as        0.6432584   714
##  5:      aa        0.6332845 32729
##  6:      mq        0.6288996 26397
##  7:      dl        0.6208831 48110
##  8:      9e        0.5579927 18460
##  9:      b6        0.5529362 54635
## 10:      yv        0.5394495   601
## 11:      ev        0.5171353 54173
## 12:      vx        0.4862600  5162
## 13:      ua        0.4712223 58665
## 14:      fl        0.4395984  3260
## 15:      f9        0.4354839   685
## 16:      wn        0.3755690 12275

2.-for each fligth can you tell wich was the last carrier that did the same travel?

require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
flights[,dep_time_stamp:=ymd_hm(paste(year,month,day,hour,minute))]
##         year month day dep_time sched_dep_time dep_delay arr_time
##      1: 2013     1   1      517            515         2      830
##      2: 2013     1   1      533            529         4      850
##      3: 2013     1   1      542            540         2      923
##      4: 2013     1   1      544            545        -1     1004
##      5: 2013     1   1      554            600        -6      812
##     ---                                                          
## 336772: 2013     9  30       NA           1455        NA       NA
## 336773: 2013     9  30       NA           2200        NA       NA
## 336774: 2013     9  30       NA           1210        NA       NA
## 336775: 2013     9  30       NA           1159        NA       NA
## 336776: 2013     9  30       NA            840        NA       NA
##         sched_arr_time arr_delay carrier flight tailnum origin dest
##      1:            819        11      ua   1545  n14228    ewr  iah
##      2:            830        20      ua   1714  n24211    lga  iah
##      3:            850        33      aa   1141  n619aa    jfk  mia
##      4:           1022       -18      b6    725  n804jb    jfk  bqn
##      5:            837       -25      dl    461  n668dn    lga  atl
##     ---                                                            
## 336772:           1634        NA      9e   3393      NA    jfk  dca
## 336773:           2312        NA      9e   3525      NA    lga  syr
## 336774:           1330        NA      mq   3461  n535mq    lga  bna
## 336775:           1344        NA      mq   3572  n511mq    lga  cle
## 336776:           1020        NA      mq   3531  n839mq    lga  rdu
##         air_time distance hour minute           time_hour not_delayed
##      1:      227     1400    5     15 2013-01-01 05:00:00           0
##      2:      227     1416    5     29 2013-01-01 05:00:00           0
##      3:      160     1089    5     40 2013-01-01 05:00:00           0
##      4:      183     1576    5     45 2013-01-01 05:00:00           1
##      5:      116      762    6      0 2013-01-01 06:00:00           1
##     ---                                                              
## 336772:       NA      213   14     55 2013-09-30 14:00:00          NA
## 336773:       NA      198   22      0 2013-09-30 22:00:00          NA
## 336774:       NA      764   12     10 2013-09-30 12:00:00          NA
## 336775:       NA      419   11     59 2013-09-30 11:00:00          NA
## 336776:       NA      431    8     40 2013-09-30 08:00:00          NA
##             N      dep_time_stamp
##      1: 58665 2013-01-01 05:15:00
##      2: 58665 2013-01-01 05:29:00
##      3: 32729 2013-01-01 05:40:00
##      4: 54635 2013-01-01 05:45:00
##      5: 48110 2013-01-01 06:00:00
##     ---                          
## 336772: 18460 2013-09-30 14:55:00
## 336773: 18460 2013-09-30 22:00:00
## 336774: 26397 2013-09-30 12:10:00
## 336775: 26397 2013-09-30 11:59:00
## 336776: 26397 2013-09-30 08:40:00
##              dep_time_stamp origin dest carrier last_carrier
##      1: 2013-01-01 13:17:00    ewr  alb      ev           NA
##      2: 2013-01-01 16:21:00    ewr  alb      ev           ev
##      3: 2013-01-01 20:04:00    ewr  alb      ev           ev
##      4: 2013-01-02 13:27:00    ewr  alb      ev           ev
##      5: 2013-01-02 16:21:00    ewr  alb      ev           ev
##     ---                                                     
## 336772: 2013-12-27 07:15:00    lga  xna      mq           mq
## 336773: 2013-12-29 15:20:00    lga  xna      mq           mq
## 336774: 2013-12-30 07:15:00    lga  xna      mq           mq
## 336775: 2013-12-30 15:20:00    lga  xna      mq           mq
## 336776: 2013-12-31 07:15:00    lga  xna      mq           mq

3.- Merge flights with airlines. The data.table way! setkey and merge!

data("airlines",package = "nycflights13" )
setDT(airlines)
##    year month day dep_time sched_dep_time dep_delay arr_time
## 1:   NA    NA  NA       NA             NA        NA       NA
## 2:   NA    NA  NA       NA             NA        NA       NA
## 3:   NA    NA  NA       NA             NA        NA       NA
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:             NA        NA      9E     NA      NA     NA   NA       NA
## 2:             NA        NA      AA     NA      NA     NA   NA       NA
## 3:             NA        NA      AS     NA      NA     NA   NA       NA
##    distance hour minute time_hour not_delayed  N dep_time_stamp
## 1:       NA   NA     NA      <NA>          NA NA           <NA>
## 2:       NA   NA     NA      <NA>          NA NA           <NA>
## 3:       NA   NA     NA      <NA>          NA NA           <NA>
##    last_carrier                   name
## 1:           NA      Endeavor Air Inc.
## 2:           NA American Airlines Inc.
## 3:           NA   Alaska Airlines Inc.

4.- Use the function IDateTime to create simultaniously two columns. First, create a vector with the two column names. Second DT[,():=IDateTime(...)] notation.

##         year month day dep_time sched_dep_time dep_delay arr_time
##      1: 2013     1   1      810            810         0     1048
##      2: 2013     1   1     1452           1455        -3     1637
##      3: 2013     1   1     1451           1500        -9     1634
##      4: 2013     1   1     1454           1500        -6     1635
##      5: 2013     1   1     1507           1515        -8     1651
##     ---                                                          
## 336772: 2013    12  27     1147           1150        -3     1342
## 336773: 2013    12  28     1540           1535         5     1708
## 336774: 2013    12  30     1143           1150        -7     1348
## 336775: 2013    12  31     1143           1150        -7     1341
## 336776: 2013    12  31     1430           1432        -2     1546
##         sched_arr_time arr_delay carrier flight tailnum origin dest
##      1:           1037        11      9e   3538  n915xj    jfk  msp
##      2:           1639        -2      9e   3295  n920xj    jfk  buf
##      3:           1636        -2      9e   4105  n8444f    jfk  iad
##      4:           1636        -1      9e   3843  n8409n    jfk  syr
##      5:           1656        -5      9e   3792  n8631e    jfk  roc
##     ---                                                            
## 336772:           1406       -24      yv   2885  n905fj    lga  clt
## 336773:           1700         8      yv   2889  n922fj    lga  phl
## 336774:           1406       -18      yv   2885  n918fj    lga  clt
## 336775:           1406       -25      yv   2885  n911fj    lga  clt
## 336776:           1555        -9      yv   3771  n515mj    lga  iad
##         air_time distance hour minute           time_hour not_delayed
##      1:      189     1029    8     10 2013-01-01 08:00:00           0
##      2:       68      301   14     55 2013-01-01 14:00:00           1
##      3:       57      228   15      0 2013-01-01 15:00:00           1
##      4:       57      209   15      0 2013-01-01 15:00:00           1
##      5:       66      264   15     15 2013-01-01 15:00:00           1
##     ---                                                              
## 336772:       98      544   11     50 2013-12-27 11:00:00           1
## 336773:       32       96   15     35 2013-12-28 15:00:00           0
## 336774:       95      544   11     50 2013-12-30 11:00:00           1
## 336775:      103      544   11     50 2013-12-31 11:00:00           1
## 336776:       52      229   14     32 2013-12-31 14:00:00           1
##             N      dep_time_stamp last_carrier day_flight time_flight
##      1: 18460 2013-01-01 08:10:00           NA 2013-01-01    08:00:00
##      2: 18460 2013-01-01 14:55:00           b6 2013-01-01    14:00:00
##      3: 18460 2013-01-01 15:00:00           ev 2013-01-01    15:00:00
##      4: 18460 2013-01-01 15:00:00           b6 2013-01-01    15:00:00
##      5: 18460 2013-01-01 15:15:00           b6 2013-01-01    15:00:00
##     ---                                                              
## 336772:   601 2013-12-27 11:50:00           us 2013-12-27    11:00:00
## 336773:   601 2013-12-28 15:35:00           us 2013-12-28    15:00:00
## 336774:   601 2013-12-30 11:50:00           us 2013-12-30    11:00:00
## 336775:   601 2013-12-31 11:50:00           us 2013-12-31    11:00:00
## 336776:   601 2013-12-31 14:32:00           ev 2013-12-31    14:00:00

Last words!

Thanks!


Availables for further questions!

Jordi Puigdellívol: jpuigde@gmail.com, https://es.linkedin.com/in/jordi-puigdellivol-654b5265


Carlos Bort: carlosebort@gmail.com, https://www.linkedin.com/in/carlosbort